25 research outputs found

    Learning a Macroscopic Model of Cultural Dynamics

    Get PDF
    A fundamental open question that has been studied by sociologists since the 70s and recently started being addressed by the computer-science community is the understanding of the role that influence and selection play in shaping the evolution of socio-cultural systems. Quantifying these forces in real settings is still a big challenge, especially in the large-scale case in which the entire social network between the users may not be known, and only longitudinal data in terms of masses of cultural groups (e.g., political affiliation, product adoption, market share, cultural tastes) may be available. We propose an influence and selection model encompassing an explicit characterization of the feature space for the different cultural groups in the form of a natural equation-based macroscopic model, following the approach of Kempe et al. [EC 2013]. Our main goal is to estimate edge influence strengths and selection parameters from an observed time series. To do an experimental evaluation on real data, we perform learning on real datasets from Last. FM and Wikipedia

    Community Detection on Evolving Graphs

    Get PDF
    Clustering is a fundamental step in many information-retrieval and data-mining applications. Detecting clusters in graphs is also a key tool for finding the community structure in social and behavioral networks. In many of these applications, the input graph evolves over time in a continual and decentralized manner, and, to maintain a good clustering, the clustering algorithm needs to repeatedly probe the graph. Furthermore, there are often limitations on the frequency of such probes, either imposed explicitly by the online platform (e.g., in the case of crawling proprietary social networks like twitter) or implicitly because of resource limitations (e.g., in the case of crawling the web). In this paper, we study a model of clustering on evolving graphs that captures this aspect of the problem. Our model is based on the classical stochastic block model, which has been used to assess rigorously the quality of various static clustering methods. In our model, the algorithm is supposed to reconstruct the planted clustering, given the ability to query for small pieces of local information about the graph, at a limited rate. We design and analyze clustering algorithms that work in this model, and show asymptotically tight upper and lower bounds on their accuracy. Finally, we perform simulations, which demonstrate that our main asymptotic results hold true also in practice

    Stochastic Query Covering for Fast Approximate Document Retrieval

    Get PDF
    We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of Web search. In the first case, we favor the retrieval of documents that are relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios

    Probabilistic Techniques in the Analysis of Dynamic Processes

    No full text
    In this thesis we study some stochastic problems related to networks and search engines. We study systems in a dynamic setting where input is continually injected into the system and the algorithm (protocol) processes it. The goal is the design and analysis of protocols that are stable and efficient in the long run. We first consider the problem of routing calls on telephone or ATM networks, for which we present a routing protocol and we compare it with the one that is commonly used (Dynamic Alternative Routing). We prove that under a standard input model (Poisson arrivals, exponential durations) our protocol has exponentially smaller bandwidth requirement than the traditional approach. Next we study the problem of load balancing on networks, where requests are continually created and serviced. We analyze a protocol under a variety of input models and we prove bounds on the expected load and the expected waiting time of a new request as time passes. Subsequently, we address a problem related to search engines. We introduce the problem of sampling results from search-engine queries, which has applications in data mining the results and offering services to the user. We present algorithms for the problem and we analyze their running time and the quality of results that we obtain. We supplement the analysis with several experiments. In modeling of dynamic phenomena as stochastic processes we place some stochastic assumption on the stream of inputs to the system. The stochastic process that generates the stream of inputs might be stationary, periodic, or even bursty. The goal is to obtain results that are valid under the weakest set of assumptions. To this end, we have to develop and apply various mathematical tools. In our analyses we apply tools from Markov processes, queueing theory, renewal theory and martingale or martingale-like processes, that enable us to handle the dependencies between the quantities that appear in our systems

    Steady State Analysis of Balanced-Allocation Routing

    No full text
    We compare the long-term, steady-state performance of a variant of the standard Dynamic Alternative Routing (DAR) technique commonly used in telephone and ATM networks, to the performance of a path-selection algorithm based on the "balanced-allocation" principle [Y. Azer, A. Z. Broder, A. R. Karlin, and E. Upfal, SIAM J Comput 29(1) (2000), 180-200; M. Mitzenmacher, Ph.D. Thesis, University of California, Berkeley, August 1996]; we refer to this new algorithm as the Balanced Dynamic Alternative Routing (BDAR) algorithm. While DAR checks alternative routes sequentially until available bandwidth is found, the BDAR algorithm compares and chooses the best among a small number of alternatives. We show that, at the expense of a minor increase in routing overhead, the BDAR algorithm gives a substantial improvement in network performance, in terms both of network congestion and of bandwidth requirement

    Bidding Strategies for Fantasy-Sports Auctions

    No full text
    Fantasy sports is a fast-growing, multi-billion dollar industry [10] in which competitors assemble virtual teams of athletes from real professional sports leagues and obtain points based on the statistical performance of those athletes in actual games. Users (team managers) can add, drop, and trade players throughout the season, but the pivotal event is the player draft that initiates the competition. One common drafting mechanism is the so-called auction draft: managers bid on athletes in rounds until all positions on each roster have been filled. Managers start with the same initial virtual budget and take turns successively nominating athletes to be auctioned, with the winner of each round making a virtual payment that diminishes his budget for future rounds. Each manager tries to obtain players that maximize the expected performance of his own team. In this paper we initiate the study of bidding strategies for fantasy sports auction drafts, focusing on the design and analysis of simple strategies that achieve good worst-case performance, obtaining a constant fraction of the best value possible, regardless of competing managers’ bids. Our findings may be useful in guiding bidding behavior of fantasy sports participants, and perhaps more importantly may provide the basis for a competitive auto-draft mechanism to be used as a bidding proxy for participants who are absent from their league’s draft. © Springer-Verlag GmbH Germany 201

    Approximation algorithms for co-clustering

    No full text
    Co-clustering is the simultaneous partitioning of the rows and columns of a matrix such that the blocks induced by the row / column partitions are good clusters. Motivated by several applications in text mining, market-basket analysis, and bioinformatics, this problem has attracted severe attention in the past few years. Unfortunately, to date, most of the algorithmic work on this problem has been heuristic in nature. In this work we obtain the first approximation algorithms for the co-clustering problem. Our algorithms are simple and obtain constant-factor approximation solutions to the optimum. We also show that co-clustering is NP-hard, thereby complementing our algorithmic result. Copyright 2008 ACM

    On Mining IoT Data for Evaluating the Operation of Public Educational Buildings

    No full text
    Public educational systems operate thousands of buildings with vastly different characteristics in terms of size, age, location, construction, thermal behavior and user communities. Their strategic planning and sustainable operation is an extremely complex and requires quantitative evidence on the performance of buildings such as the interaction of indoor-outdoor environment. Internet of Things (IoT) deployments can provide the necessary data to evaluate, redesign and eventually improve the organizational and managerial measures. In this work a data mining approach is presented to analyze the sensor data collected over a period of 2 years from an IoT infrastructure deployed over 18 school buildings spread in Greece, Italy and Sweden. The real-world evaluation indicates that data mining on sensor data can provide critical insights to building managers and custodial staff about ways to lower a buildings energy footprint through effectively managing building operations

    Effective and efficient classification on a search-engine model

    No full text
    Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query. © Springer-Verlag London Limited 2007

    Peer and authority pressure in information-propagation models

    No full text
    Existing models of information diffusion assume that peer influence is the main reason for the observed propagation patterns. In this paper, we examine the role of authority pressure on the observed information cascades. We model this intuition by characterizing some nodes in the network as "authority" nodes. These are nodes that can influence large number of peers, while themselves cannot be influenced by peers. We propose a model that associates with every item two parameters that quantify the impact of the peer and the authority pressure on the item's propagation. Given a network and the observed diffusion patterns of the item, we learn these parameters from the data and characterize the item as peer- or authority-propagated. We also develop a randomization test that evaluates the statistical significance of our findings and makes our item characterization robust to noise. Our experiments with real data from online media and scientific-collaboration networks indicate that there is a strong signal of authority pressure in these networks. © 2011 Springer-Verlag
    corecore